影响红酒质量的因素评估
========================================================
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
通过分析数据发现,这里包含的1599条数据中,酒的质量评分在3~8分之间。 没有评分非常高(10)和评分非常低的(1)的数据。
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.274 2.500 5.000
通过图像分析,我们可以看到大部分的甜度在1.5~2.5之间。(g / dm^3)
ggplot(red_wine, aes(alcohol)) +
geom_histogram(binwidth = 0.1) +
geom_vline(xintercept = median(red_wine$alcohol), color = 'royalblue') +
geom_vline(xintercept = mean(red_wine$alcohol), color = 'coral')
summary(red_wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(red_wine, aes(x = chlorides)) +
geom_histogram() +
xlim(quantile(red_wine$chlorides, 0.05), quantile(red_wine$chlorides, 0.95)) +
xlab("chlorides (middle 95%)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 158 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(subset(red_wine$chlorides,
red_wine$chlorides < quantile(red_wine$chlorides, 0.95)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07800 0.07914 0.08800 0.12600
ggplot(red_wine, aes(x=density)) +
geom_density() +
stat_function(linetype = 'dashed',
color = 'royalblue',
fun = dnorm,
args = list(mean = mean(red_wine$density), sd = sd(red_wine$density)))
文档中包含了1599条记录,每一条记录包含了12个属性。
是什么因素导致了红酒质量的变化。 但是数据中的红酒的评分的范围在3~8分之间,所以没有特别好的酒和特别差的酒。5.6360225
目前只探究了单一变量的一些数据情况,还没有办法知道哪些因素是影响红酒的质量的元素
round(cor(red_wine), 3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.256 0.672
## volatile.acidity -0.256 1.000 -0.552
## citric.acid 0.672 -0.552 1.000
## residual.sugar 0.115 0.002 0.144
## chlorides 0.094 0.061 0.204
## free.sulfur.dioxide -0.154 -0.011 -0.061
## total.sulfur.dioxide -0.113 0.076 0.036
## density 0.668 0.022 0.365
## pH -0.683 0.235 -0.542
## sulphates 0.183 -0.261 0.313
## alcohol -0.062 -0.202 0.110
## quality 0.124 -0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.115 0.094 -0.154
## volatile.acidity 0.002 0.061 -0.011
## citric.acid 0.144 0.204 -0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 -0.022
## pH -0.086 -0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 -0.221 -0.069
## quality 0.014 -0.129 -0.051
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.113 0.668 -0.683 0.183 -0.062
## volatile.acidity 0.076 0.022 0.235 -0.261 -0.202
## citric.acid 0.036 0.365 -0.542 0.313 0.110
## residual.sugar 0.203 0.355 -0.086 0.006 0.042
## chlorides 0.047 0.201 -0.265 0.371 -0.221
## free.sulfur.dioxide 0.668 -0.022 0.070 0.052 -0.069
## total.sulfur.dioxide 1.000 0.071 -0.066 0.043 -0.206
## density 0.071 1.000 -0.342 0.149 -0.496
## pH -0.066 -0.342 1.000 -0.197 0.206
## sulphates 0.043 0.149 -0.197 1.000 0.094
## alcohol -0.206 -0.496 0.206 0.094 1.000
## quality -0.185 -0.175 -0.058 0.251 0.476
## quality
## fixed.acidity 0.124
## volatile.acidity -0.391
## citric.acid 0.226
## residual.sugar 0.014
## chlorides -0.129
## free.sulfur.dioxide -0.051
## total.sulfur.dioxide -0.185
## density -0.175
## pH -0.058
## sulphates 0.251
## alcohol 0.476
## quality 1.000
ggplot(red_wine, aes(x = alcohol, y = quality)) +
geom_point()
ggplot(red_wine, aes(x = alcohol, y = quality)) +
geom_jitter(alpha = 0.25) +
geom_smooth(method = "lm")
从图形中来看,红酒的质量和酒精浓度有点正相关,相关度为0.476
ggplot(red_wine, aes(x = residual.sugar, y = quality)) +
xlim(0, quantile(red_wine$residual.sugar, 0.95)) +
xlab("residual sugar (bottom 95%") +
geom_jitter(alpha = 0.15)
## Warning: Removed 81 rows containing missing values (geom_point).
ggplot(red_wine, aes(x = volatile.acidity, y = quality)) +
geom_jitter(alpha = 0.25) +
geom_smooth(method = 'lm')
ggplot(red_wine, aes(x = fixed.acidity, y = pH)) +
geom_point(alpha = 0.25) +
geom_smooth(method = 'lm')
所有属性中,和红酒质量评分有较高相关性的属性就是“酒精浓度”,相关性达到了0.476 而红酒的质量又和挥发酸有比较强的负相关,挥发酸越强,红酒的质量相对较差。
密度和酒精浓度有很强的负相关,这个挺意外的,可能是因为自己对红酒的组成元素一点都不了解吧。。。
酸度和PH值的负相关性,这个比较好猜,酸度越大PH值越低。
ggplot(red_wine, aes(x = alcohol, y = quality, color = volatile.acidity)) +
geom_jitter() +
scale_color_gradient(high = 'blue', low = 'green')
ggplot(red_wine, aes(x = alcohol, y = quality, color = citric.acid)) +
geom_jitter() +
scale_color_gradient(high = 'green', low = 'blue')
ggplot(red_wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
geom_jitter() +
scale_color_brewer()
ggplot(red_wine, aes(x = alcohol, y = quality, color = citric.acid)) +
geom_jitter() +
scale_color_gradient(high = 'red', low = 'blue')
ggplot(red_wine, aes(x = alcohol, y = density, color = residual.sugar)) +
geom_jitter() +
scale_color_gradient2(limits=c(0, quantile(red_wine$residual.sugar, 0.95)),
midpoint = median(red_wine$residual.sugar))
通过图像表示,酒精浓度提升,挥发酸下降时,相应的红酒质量是提升的。
ggplot(red_wine, aes(alcohol)) +
geom_histogram(binwidth = 0.1) +
geom_vline(xintercept = median(red_wine$alcohol), color = 'royalblue') +
annotate('text',
x = median(red_wine$alcohol) - 0.35,
y = 120,
label = paste('median\n(', median(red_wine$alcohol), ')', sep = ''),
color = 'royalblue') +
geom_vline(xintercept = mean(red_wine$alcohol), color = 'red') +
annotate('text',
x = mean(red_wine$alcohol) + 0.35,
y = 120,
label = paste('mean\n(', round(mean(red_wine$alcohol), 2), ')', sep = ''),
color = 'red') +
xlab("Alcohol (%)") +
ylab("Numbers")
根据数据酒精浓度和红酒质量存在相关性,所以这里了解下不同酒精浓度的数量情况。 可以看到均值(10.2)小于中位数(10.42)
ggplot(red_wine, aes(x = alcohol, y = quality)) +
geom_jitter(alpha = 0.1, height = 0.48, width = 0.025) +
geom_smooth(method = "lm") +
ggtitle("Quality vs Alcohol Content") +
xlab("Alcohol (%)") +
ylab("Quality (0-10)")
这里展示了两个变量之间的相关性。
ggplot(red_wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
geom_jitter() +
scale_color_brewer(name = "Quality") +
ggtitle("Quality by Volitile Acidity and Alcohol") +
xlab("Alcohol (%)") +
ylab("Volitile Acidity (g/L)")
这里展示了当红酒的质量提升时,对应的酒精含量上升同时挥发酸下降。